28 research outputs found

    Atypicity Detection in Data Streams: a Self-Adjusting Approach

    Get PDF
    International audienceOutlyingness is a subjective concept relying on the isolation level of a (set of) record(s). Clustering-based outlier detection is a field that aims to cluster data and to detect outliers depending on their characteristics (i.e. small, tight and/or dense clusters might be considered as outliers). Existing methods require a parameter standing for the "level of outlyingness", such as the maximum size or a percentage of small clusters, in order to build the set of outliers. Unfortunately, manually setting this parameter in a streaming environment should not be possible, given the fast time response usually needed. In this paper we propose WOD, a method that separates outliers from clusters thanks to a natural and effective principle. The main advantages of WOD are its ability to automatically adjust to any clustering result and to be parameterless

    Web Usage Mining : extraction de périodes denses à partir des logs

    Get PDF
    National audienceLes techniques de Web Usage Mining existantes sont actuellement basées sur un découpage des données arbitraire (e.g. "un log par mois") ou guidé par des résultats supposés (e.g. "quels sont les comportements des clients pour la période des achats de Noël ? "). Ces approches souffrent des deux problèmes suivants. D'une part, elles dépendent de cette organisation arbitraire des données au cours du temps. D'autre part elles ne peuvent pas extraire automatiquement des "pics saisonniers" dans les données stockées. Nous proposons d'exploiter les données pour découvrir de manière automatique des périodes "denses" de comportements. Une période sera considérée comme "dense" si elle contient au moins un motif séquentiel fréquent pour l'ensemble des utilisateurs qui étaient connectés sur le site à cette période

    Can we assess mental health through social media and smart devices? addressing bias in methodology and evaluation.

    Get PDF
    Predicting mental health from smartphone and social media data on a longitudinal basis has recently attracted great interest, with very promising results being reported across many studies. Such approaches have the potential to revolutionise mental health assessment, if their development and evaluation follows a real world deployment setting. In this work we take a closer look at state-of-the-art approaches, using different mental health datasets and indicators, different feature sources and multiple simulations, in order to assess their ability to generalise. We demonstrate that under a pragmatic evaluation framework, none of the approaches deliver or even approach the reported performances. In fact, we show that current state-of-the-art approaches can barely outperform the most naive baselines in the real-world setting, posing serious questions not only about their deployment ability, but also about the contribution of the derived features for the mental health assessment task and how to make better use of such data in the future

    Scalable Similarity Matching in Streaming Time Series

    Get PDF
    Nowadays online monitoring of data streams is essential in many real life applications, like sensor network monitoring, manufacturing process control, and video surveillance. One major problem in this area is the online identification of streaming sequences similar to a predefined set of pattern-sequences. In this paper, we present a novel solution that extends the state of the art both in terms of effectiveness and efficiency. We propose the first online similarity matching algorithm based on Longest Common SubSequence that is specifically designed to operate in a streaming context, and that can effectively handle time scaling, as well as noisy data. In order to deal with high stream rates and multiple streams, we extend the algorithm to operate on multilevel approximations of the streaming data, therefore quickly pruning the search space. Finally, we incorporate in our approach error estimation mechanisms in order to reduce the number of false negatives. We perform an extensive experimental evaluation using forty real datasets, diverse in nature and characteristics, and we also compare our approach to previous techniques. The experiments demonstrate the validity of our approach. The original publication is available in PAKDD 2012, Proceedings in Lecture Notes in Artificial Intelligence (LNAI), Springer Verlag (www.springerlink.com)

    Extraction de motifs séquentiels dans les flux de données

    No full text
    In recent years, many applications dealing with data generated continuously and at high speeds have emerged. These data are now qualified as data streams. Dealing with potentially infinite quantities of data imposes constraints that raise many processing problems. As an example of such constraints we have the inability to block the data stream as well as the need to produce results in real time. Nevertheless, many application areas (such as bank transactions, Web usage, network monitoring, etc.) have attracted a lot of interest in both industry and academia. These potentially infinite quantities of data prohibit any hope of complete storage; we need, however, to be able to examine the history of the data streams. This led to the compromise of "summaries" of the data stream and "approximate" results. Today, a huge number of different types of data stream summaries have been proposed. However, continuous developments in technology and in corresponding applications demand a similar progress of summary and analysis methods. Moreover, sequential pattern extraction is still little studied : when this thesis began, there were no methods for extracting sequential patterns from data streams. Motivated by this context, we are interested in a method that summarizes the data stream in an efficient and reliable way and that has as main purpose the extraction of sequential patterns. In this thesis, we propose the CLUSO (Clustering, Summarizing and Outlier detection) approach. CLUSO allows us to obtain clusters from a stream of sequences of itemsets, to compute and maintain histories of these clusters and to detect outliers. The contributions detailed in this report concern : - Clustering sequences of itemsets in data streams. To the best of our knowledge, it is the first work in this domain. - Summarizing data streams by way of sequential pattern extraction. Summaries given by CLUSO consist of aligned sequential patterns representing clusters associated to their history in the stream. The set of such patterns is a reliable summary of the stream at time t. Managing the history of these patterns is a crucial point in stream analysis. With CLUSO we introduce a new way of managing time granularity in order to optimize this history. - Outlier detection. This detection, when related to data streams, must be fast and reliable. More precisely, stream constraints forbid requiring parameters or adjustments from the end-user (ignored outliers or their late detection can be detrimental). Outlier detection in CLUSO is automated and self-adjusting. We also present a case study on real data, written in collaboration with Orange Labs.Ces dernières années ont vu apparaître de nombreuses applications traitant des données générées en continu et à de grandes vitesses. Ces données sont désormais connues sous le nom de flux de données. Leurs quantités de données potentiellement infinies ainsi que les contraintes qui en dérivent posent de nombreux problèmes de traitement. Parmi ces contraintes, citons par exemple l'impossibilité de bloquer un flux de données, ou encore le besoin de produire des résultats en temps réel. Néanmoins, les multiples domaines d'application de ces traitements (comme les transactions bancaires, l'usage du Web, la surveillance des réseaux, etc) ont suscité beaucoup d'intérêt tant dans les milieux industriels qu'académiques. Ces quantités potentiellement infinies de données interdisent tout espoir de stockage complet ; toutefois, on a besoin de pouvoir interroger l'historique des flux. Cela a conduit au compromis des « résumés » des flux de données et des résultats « approximatifs ». Aujourd'hui, un grand nombre de méthodes propose différents types de résumés des flux de données. Mais le développement incessant de la technologie et des applications afférentes demande un développement au moins équivalent des méthodes d'analyse et de résumé. De plus, l'extraction de motifs séquentiels y est encore peu étudiée: au commencement de cette thèse, il n'existait aucune méthode d'extraction de motifs séquentiels dans les flux de données. Motivés par ce contexte, nous nous sommes intéressés à une méthode qui résume les flux de données d'une manière efficace et fiable et qui permet principalement d'en extraire des motifs séquentiels. Dans cette thèse, nous proposons l'approche CLARA (CLAssification, Résumés et Anomalies). CLARA permet d'obtenir des clusters à partir d'un flux de séquences d'itemsets, de calculer et gérer des résumés de ces clusters et d'y détecter des anomalies. Les différentes contributions détaillées dans ce mémoire concernent: - La classification non supervisée de séquences d'itemsets sous forme de flux. A notre connaissance, cette technique est la première à permettre une telle classification. - Les résumés de flux de données à l'aide de l'extraction de motifs. Les résumés de CLARA sont composés de motifs séquentiels alignés représentant les clusters associés à leur historique dans le flux. L'ensemble de ces motifs permet de résumer le flux de manière fiable à un instant t. La gestion de l'historique de ces motifs est un point essentiel dans l'analyse des flux. CLARA introduit une nouvelle gestion de la granularité temporelle afin d'optimiser cet historique. - La détection d'anomalies. Cette détection, quand elle concerne les flux, doit être rapide et fiable. En particulier, les contraintes liées aux flux interdisent de consulter l'utilisateur final pour ajuster des paramètres (une anomalie détectée trop tard peut avoir de graves conséquences). Avec CLARA, cette détection est automatique et auto-adaptative. Nous proposerons également un cas d'étude sur des données réelles, réalisé en collaboration avec Orange Labs

    Mining Data Streams for Frequent Sequences Extraction

    No full text
    International audienceIn recent years, emerging applications introduced new constraints for data mining methods. These constraints are particularly linked to new kinds of data that can be considered as complex data. One typical kind of such data is known as data streams. In a data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered as fast as possible, no blocking operator can be performed and the data can be examined only once. At this time and to the best of our knowledge, no method has been proposed for mining sequential patterns in data streams. We argue that the main reason is the combinatory phenomenon related to sequential pattern mining. In this paper, we propose an algorithm based on sequences alignment for mining approximate sequential patterns in Web usage data streams. To meet the constraint of one scan, a greedy clustering algorithm associated to an alignment method are proposed. We will show that our proposal is able to extract relevant sequences with very low thresholds

    Mining Sequential Patterns from Data Streams: a Centroid Approach

    No full text
    International audienceIn recent years, emerging applications introduced new constraints for data mining methods. These constraints are typical of a new kind of data: the data streams. In data stream processing, memory usage is restricted, new elements are generated continuously and have to be considered in a linear time, no blocking operator can be performed and the data can be examined only once. At this time, only a few methods has been proposed for mining sequential patterns in data streams. We argue that the main reason is the combinatory phenomenon related to sequential pattern mining. In this paper, we propose an algorithm based on sequences alignment for mining approximate sequential patterns in Web usage data streams. To meet the constraint of one scan, a greedy clustering algorithm associated to an alignment method is proposed. We will show that our proposal is able to extract relevant sequences with very low thresholds
    corecore